Improvements in Parsing the Index Thomisticus Treebank. Revision, Combination and a Feature Model for Medieval Latin
نویسندگان
چکیده
The creation of language resources for less-resourced languages like the historical ones benefits from the exploitation of language-independent tools and methods developed over the years by many projects for modern languages. Along these lines, a number of treebanks for historical languages started recently to arise, including treebanks for Latin. Among the Latin treebanks, the Index Thomisticus Treebank is a 68,000 token dependency treebank based on the Index Thomisticus by Roberto Busa SJ, which contains the opera omnia of Thomas Aquinas (118 texts) as well as 61 texts by other authors related to Thomas, for a total of approximately 11 million tokens. In this paper, we describe a number of modifications that we applied to the dependency parser DeSR, in order to improve the parsing accuracy rates on the Index Thomisticus Treebank. First, we adapted the parser to the specific processing of Medieval Latin, defining an ad-hoc configuration of its features. Then, in order to improve the accuracy rates provided by DeSR, we applied a revision parsing method and we combined the outputs produced by different algorithms. This allowed us to improve accuracy rates substantially, reaching results that are well beyond the state of the art of parsing for Latin.
منابع مشابه
Differentia compositionem facit. A Slower-Paced and Reliable Parser for Latin
The Index Thomisticus Treebank is the largest available treebank for Latin; it contains Medieval Latin texts by Thomas Aquinas. After experimenting on its data with a number of dependency parsers based on different supervised machine learning techniques, we found that DeSR with a multilayer perceptron algorithm, a right-to-left transition, and a tailor-made feature model is the parser providing...
متن کاملThe Development of the "Index Thomisticus" Treebank Valency Lexicon
We present a valency lexicon for Latin verbs extracted from the Index Thomisticus Treebank, a syntactically annotated corpus of Medieval Latin texts by Thomas Aquinas. In our corpus-based approach, the lexicon reflects the empirical evidence of the source data. Verbal arguments are induced directly from annotated data. The lexicon contains 432 Latin verbs with 270 valency frames. The lexicon is...
متن کاملThe Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon
We present an overview of the Index Thomisticus Treebank project (IT-TB). The ITTB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11million-token Latin corpus of the texts by Thomas Aquinas. We briefly describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is...
متن کاملThe Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin
The paper describes the treatment of some specific syntactic constructions in two treebanks of Latin according to a common set of annotation guidelines. Both projects work within the theoretical framework of Dependency Grammar, which has been demonstrated to be an especially appropriate framework for the representation of languages with a moderately free word order, where the linear order of co...
متن کاملFrom Syntax to Semantics. First Steps Towards Tectogrammatical Annotation of Latin
Assuming that collaboration between theoretical and computational linguistics is essential in projects aimed at developing language resources like annotated corpora, this paper presents the first steps of the semantic annotation of the Index Thomisticus Treebank, a dependency-based treebank of Medieval Latin. The semantic layer of annotation of the treebank is detailed and the theoretical frame...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010